home *** CD-ROM | disk | FTP | other *** search
-
-
- Diff - A Change Bar Tool
-
-
- by Peter van Es CompuServe 72677,1417
- and mods by Steve Kraus 71531,2062
-
- As a person who has to maintain large amounts of
- documentation and needs to be able to track changes through
- successive revisions of the same document, I quickly felt the
- need for a tool which would insert change bars into
- documents.
-
- Remembering an article in Dr. Dobb's Journal, August 1987, by
- Don Krantz, I found a utility written in C which would insert
- Change Bars into existing document files in normal ASCII,
- exactly what I needed. I entered the program on my system,
- converting it to Turbo C (mainly using the Turbo C library
- routines for string comparisons) and tested it. Whereas the
- program worked fine on small text files, I had files for a
- document of 900 K text to process. The program's limitations
- became painfully obvious very rapidly. As it was written,
- the program could not:
-
- - deal with differences of over 200 lines, since the
- resync module only looked 200 lines ahead;
-
- - handle large text files since both uppercase and normal
- versions of the line were stored in memory.
-
- Furthermore the program was very slow on dealing with small
- changes (1 line inserted or deleted) due to the way the
- resynchronization algorithm was implemented.
-
- Rather than re-designing the program I went through a series
- of revisions on the program, tackling problems in small
- chunks. Perhaps it would have been better to re-design the
- program, but the resulting program has solved all of these
- limitations, and works just fine for my needs. Some of the
- existing limitations are documented at the end of this
- document.
-
- Why have I documented the change process here? Well, I
- dislike two things:
-
- - reinventing the wheel;
-
- - optimizing programs by delving into optimizers, or
- transforming parts into assembler routines.
-
- I believe in optimizing programs through the use of
- different, more efficient algorithms. Changing a high level
- language into assembler in selected places may give you a 10%
- speed increase, or, if you are lucky, a 50% increase. By
- altering the algorithm you can achieve improvements of 100%
- to 1000% much more easily (depending on the problem, of
- course). I hope that my thought processes might be of use to
- someone else, which is why I document them in this document.
-
-
- Page 1
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
- The Improvements
-
-
- The first problem was easy to solve. The original program
- stored a line read into memory both in normal format and in
- uppercase format to allow case insensitive change bars to be
- inserted. Since Turbo C provides a subroutine to compare two
- strings either with case sensitivity (strcmp()) or without
- (stricmp()) the memory use of the program was halved at the
- cost of some processing speed (deciding which one of the two
- comparison routines to use). Since these routines have been
- optimized by Borland, they actually performed faster than the
- routine that was originally in the program.
-
- The program included a good memory management scheme, which
- ensured that every line of text was only read once, and
- allocated memory would be freed at the earliest possible
- opportunity, when a match has been found. Since this scheme
- is quite efficient, I decided to keep it as an essential part
- of the program.
-
- Upon inspection the program was still CPU bound, spending
- most of its time in string comparisons, with the disk drive
- being idle. In order to eliminate the time on most of these
- comparisons, where the same line of text is compared over and
- over again with other lines, a simple hash value of the
- characters in the line is added to the structure containing
- the line. This hash value, or checksum, is an exclusive or
- of all the characters in the line. Instead of comparing
- lines, the program now compares the pre-calculated checksums,
- and only if they are equal, will it compare the line
- character for character. This reduced the amount of
- processing on similar lines considerably.
-
- However, the program was still very slow, which was very
- noticeable on files with lots of small changes. Furthermore,
- the program couldn't cope with changes of more than 200 lines
- of text. Checking the program revealed why this was so. The
- program would always take a line from file 1, and look up to
- 200 lines ahead in file 2, to find the matching line. Then
- it would take line 2 of file 1, and look 200 lines ahead from
- the current line of file 2 again. And so on, until it had
- found a minimum of re-sync (which is a command line
- modifiable value, normally set at 5) of un-changed lines.
- This meant that in order to detect a 1 line change, the
- program could perform up to a 1000 line comparisons.
-
- Since most changes two documents are small ones, I modified
- the program so that it now starts by checking only 10 lines
- ahead. If that fails, it looks 20 lines ahead, then 40, 80
- and so on until it runs out of memory. There is no upper
- limit on the number of lines, other than constrained by
- memory. As a result, a small change requires far fewer
- comparisons and is detected far more rapidly.
-
-
- Page 2
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
-
- Big changes can also be detected, since the program does not
- have the artificial constraint of 200 lines look ahead
- anymore. If it doesn't match on 80 lines it tries 160, then
- 320 and so on until it runs out of memory. Since the memory
- management function is quite efficient (discarding lines as
- soon as a match has been found) quite a few lines can be
- stored in memory. That is also why the compact compilation
- model is used, maximizing the amount of allocatable memory
- for a small program.
-
- There still turned out a problem with this. The program
- could now almost always synchronize, but on extremely large
- changes this was still very slow. The reason for this is,
- that every time the lookahead number of lines is doubled, the
- program has to start at line 1 of file 1 again, but now
- checking half the number of lines in file 2 for the second
- time, and half for the first time. Ie, the same line would
- be compared against the same line lots of times anyway. In
- order to minimize the amount of repeat lookups which occur
- every time that the lookahead lines are increased, the
- program now assumes that after 40 lines couldn't match, the
- change is probably very big, and the lookahead is set to 400
- lines. If it still doesn't match, it sets them at 800, 1600
- etc until it runs out of memory.
-
- Regrettably, although this helped a little bit, it was only a
- little bit, since for each of the 400 lines in file 1 each
- and everyone of the 400 lines in file 2 are checked, leaving
- the program still CPU bound.
-
- Then a brainwave hit. If we can use the checksum (a value
- between 0 and 255) to compare lines in the files, why can't
- we use it to quickly locate a line already read in for file
- 2? So an array was created, with an entry for each possible
- checksum value. This entry is the start of an ordered linked
- list, pointing to lines with the same checksum. Rather than
- having to check all lookahead lines in order to resync, the
- program uses the checksum value of the line in file 1 to
- index into the array of linked lists. It then traverses the
- list of lines with the same checksum, comparing the lines
- until one has been found which is equal. If none are found,
- the program takes the next line from file 1, and repeats the
- process. If no match can be found at all, the program
- restarts the lookahead procedure with twice the number of
- lines to look ahead. In order to ensure that sufficient
- lines are present, lookahead lines are read every time, which
- means that the memory management still works and memory gets
- cleared every time a synchronization has been detected. At
- one point a resync will either be found, or end of file
- reached (or memory is full -- this will not happen soon since
- matched lines are discarded from memory, so the change would
- have to exceed the size of memory).
-
-
-
- Page 3
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
- However, since file 2 is not scanned sequentially but
- randomly, the last line before a match (which is used to be
- able to produce the "deleted from file 2" difference list) is
- not available any more. In order to allow for this, the line
- structure now includes a pointer to the previous line, so
- that the difference summary can still be printed.
-
- The nextline() and discard() functions maintain the line
- structure list, and have been modified to maintain the
- checksum array and the previous line list as well. Resync
- has been altered substantially in order to take advantage of
- the array. Minor modifications have been made to other
- routines in order to provide extra error checking.
-
- As a result of these modifications the program now is much
- faster especially with lots of small changes (due to the
- lookahead line policy), and all large changes (due to the
- checksum linked list). So much so, in fact, that it is also
- I/O bound (ie the disk drive is almost operating
- continuously). And it can handle bigger changes. The
- program now can put change bars into my 900 K of document
- faster than my word processor can print it.
-
-
- Limitations of the program.
-
- The main limitation of the program is due to the resynchro-
- nization algorithm. The program considers a file
- resynchronized if it discovers a number of matched lines.
- This number is set at 3 as a default, but can be modified
- using the -r option. However, the results of this algorithm
- can be surprising. When text has been duplicated (with more
- than resync lines) and sandwiched in between new text, and
- the old text has been left unchanged after the inserted text
- (as in the following example), a match will be detected. In
- the example assume that resync is 3.
-
- file 1 file 2
-
- A P
- B _______ Q
- C \ R
- D ______ \__________________ B
- E ____ \ C
- F \ \___________________ D
- G \ S
- H _ \ T
- I \ \ B
- \ \ C
- \ \ D
- \ \________________ E
- A false match... F
- \ G
- \___________________ H
-
-
- Page 4
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
- As can be seen above, BCD in file 1 will match with the first
- occurrence of BCD in file 2. And as a result, the second BCD
- in file 2 will not match with the BCD in file 1. (The
- remainder, EFGH will match again). A more logical match
- would have been to leave the PQBCDRST sequence as a new
- insert, and just match on the sequence BCDEFGH. (Note that
- the first match could have been avoided by selecting the
- number of lines to be resynced to be larger than 3).
-
- A different algorithm, called the "longest common
- subsequence" algorithm, will do just that automatically. It
- will select the longest matching sequence in the file and use
- that as a basis for selecting the second longest matching
- sequence both above and below the longest matching sequence
- just found, and so on until all sequences have been found.
- This method is also particularly suitable for implementation
- using hashing. However, a disadvantage of this algorithm is
- that no lines can be discarded until both files have been
- processed in their entirety. As a result either lines have
- to be kept on disk (for instance storing the lseek() value in
- the line buffer with the checksum, so that lines can be
- retrieved from disk using direct access), or the size of the
- files that can be processed is limited. Furthermore, since
- the program will recursively check for matching subsequences
- (compare it for instance with CAR Hoare's QuickSort
- algorithm) this algorithm will produce better difference
- reports only at the cost of much more computer time. Thus
- this algorithm is used most often on mini-computer systems.
-
- Note that a mix between the "longest common subsequence"
- algorithm as described above, and the "scan until next match"
- as implemented in the program presented here can be
- implemented to good effect. Rather than having a minimum
- number of lines before resync is achieved, select a maximum
- number of lines after which a matching subsequence is
- considered the longest, causing the file to be split. By
- setting this number as one larger than the largest sequence
- which occurs more than once in a file (eg 100 lines) the
- program will produce the same results as the unconstrained
- version of the program, in less processing time.
-
- A final limitation of the program is due to the fact that it
- compares files on a line by line basis. This is fine for for
- instance program source code, but gives problems with most
- automatic paragraph reformatting features available in word
- processors. The result of inserting a word in one sentence
- can often be a re-format of several lines of text in the
- paragraph. Although in reality those sentences have not
- changed, the program will think they are changed.
-
- The solution is to perform the comparison on a sentence by
- sentence (or word by word) basis rather than line by line.
- The difficulty is however not in the synchronization
- algorithm (since that can stay as it is), but in the
-
-
- Page 5
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
- algorithm which inserts change bars on the correct line in
- the correct place in the document. You would have to store a
- representation of the original line of text somewhere, and
- use a pointer to associate every word with the line from
- which it came. This modification is left as an exercise for
- the reader, however. If I ever get to it, I'll put the
- resulting code in Public Domain again.
-
-
- Using the diff program.
-
- diff - text file differencer and change barrer
- usage: diff [option{option}] newfile oldfile [barfile]
-
- Where the options are optional and can be inserted in any order
- except for the -n and -o option (which must appear in that
- order). Newfile and Oldfile are required (you don't want to
- compare a file against nothing, do you?). Barfile is optional,
- used if you want a file with change bars in it as output.
-
- The options are:
-
- -t trace operation, default off
- -b n change Bar column in barfile, default 78
- -h n Header lines to skip at top of page, default 0
- -f n Footer lines to skip at bottom of page, default = 0
- -p n Page size (embedded form feeds override) default = 66
- -c Case. uppercase/lowercase is significant (default is off)
- -r n Resync lines that must match before files are considered
- synced after differences are found. default = 3
- -m n left Margin to start comparison, default = 0
- -n Number display line number instead of page & line
- number on output (default is pages/line numbers)
- -w White blank lines are considered significant (default is not)
- -s Screen listing off (default is on)
- -u n Updated NEWFILE pages to skip before compare. default = 0,
- also sets -o.
- -o n OLDFILE pages in skip before compare. Must come after -u.
- default = 0
-
- The Trace option (-t) only shows if the program has been
- compiled with the debug option on. It shows debug messages.
-
- The Bar-column option (-b) selects the column in which the
- vertical bar will be placed. The default is column 78.
- Column 0 will put it in the left most position on the page.
- If your program contains embedded tabs, these are not
- expanded automatically to ensure that when the bar should
- appear in column 78, it actually does appear there. So use
- the option -b 0 in this case.
-
-
- Page 6
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
-
- The Header option (-h) and Footer option (-f) allow you to
- specify the number of lines to skip at the top of the page
- and the bottom of the page for headers and footers. You
- don't usually want change bars just because the date or page
- numbering of a file has changed.
-
- The Page length option (-p) allows you to specify the number
- of lines per page. 66 is the Default.
- The Case Sensitive (-c) option will insert change bars if
- just the case of text has changed. The default is no case
- sensitivity, so that for instance changing "New york" into
- "New York" does not generate a change bar. For case
- sensitive language's program listings you'll want to use the
- -c option.
-
- The Resync option (-r) specifies the number of lines that
- must match before the files are considered re-synced. The
- limitations of this parameter have been discussed above. The
- default is 3. For program source code you'll want to
- experiment with slightly larger values.
-
- The White option (-w) makes blank lines significant. The
- default is that they are not significant. I never use this
- option, but perhaps you have a need for it.
-
- The Screen option (-s) turns off the screen listing of the
- differences. This speeds up the program when you are not
- interested in the differences list, or when you use the
- program in a batch file. Note that the program will give an
- error message if you specify the -s option and no bar file
- (since then all it would do is waste computer time). The
- differences list to screen has the following format:
-
- 002:12<This text has been added to the newfile on page 2,
- 002:13<lines 12 and 13.
- 002:12>This line has been deleted from oldfile, page 2 and
- 002:13>lines 12 and 13 and 14. Note that every time a match
- 002:14>has been found a blank line is printed first.
-
- (Note the direction of the arrows: < for insertion into the
- newfile, >for deletion (extraction if you like) from the
- oldfile to the newfile).
-
- The Newfile skip option (-n) and the Oldfile skip option (-o)
- allow you to skip pages at the beginning of the file for
- header pages and contents lists and the like, where you are
- not interested in change bars. Note that the -n option sets
- the -o option (unless you override the -o setting by
- explicitly including it in case contents lists are different
- lengths).
-
-
- Note that a space is required between a switch character and
- its value. Use "-r 5", not "-r5" or "/r5". - SK
-
-
-
- Page 7
-
-
-
-
-
- Diff - A Change Bar Tool
-
-
- Notes on Modifications by Steve Kraus. First, let me say
- thanks for the invaluable tool by Don Krantz and extensively
- modified by Peter van Es. I have used this program often.
- But I am a programmer/analyst by profession. And DIFF is
- designed to compare documentation on a page/line basis
- instead of a program source code basis.
-
- I often redirect the output from DIFF to another file or Vern
- Buerg's LIST utility for perusal. DIFF originally wrote all
- of its output and error messages to STDOUT. I added usage of
- STDERR for the program version and error message lines, with
- 'screen' output and the help message written to STDOUT.
-
- When I first started using DIFF, I had trouble using the
- switch options until I browsed through the source code. I am
- accustomed to using the forward slash switch character. So I
- added recognition for the additional "/" switch character and
- recognition of the upper and lower case switches. I modified
- the help message to supply the keywords matching the options,
- such as 'Screen' for -s.
-
- I'm more interested in the line number instead of the page
- number. Several PC editors enable you to start right at a
- given line number. I wanted the difference display to show
- line numbers instead of the page : line combination. So I
- added the line numbering switch, called "-n" in honor of DOS
- utilities FIND, GREP, and FC. I renamed the old -n switch to
- -u for Update file (sorry, Peter)
-
- I've been writing test set diagnostic software for HP9000
- Basic systems. The HP Basic editor is miserable. After
- transferring the saved Ascii files over to PC for editting
- and printing, I would run DIFF on program files to verify the
- program version changes. Unfortunately, HP Basic renumbers
- source lines automatically for inserted lines, changing the
- content of every line. So I devised the -m n Margin switch,
- so I could ignore the mandatory Basic line numbers at the
- beginning of every line. The checksum and line comparison
- begins at the "/M 5" left margin column, if specified.
-
- Plans for the next DIFF? Another logical change (for the
- future) would be to add a right margin column too. I'd like
- to modify option switch processing to allow the option value
- to be catenated against the switch name. That is, allow
- switches of the format: -b5 or /R7 instead of requiring
- the space between the value and the switch name: -b 5 -r 7
-